Sql Extensions for Domain Agnostic Data Representational Consistency

نویسنده

  • Bingyu Yi
چکیده

The explosion of big data exacerbated the significance of data quality in decision support systems and data warehouses. Data inconsistency, as a significant data quality problem especially for heterogeneous databases, is mainly researched in three aspects: data integrity, semantics, and representational inconsistencies. The data integrity aspect has been well researched and implemented into DBMSs and data warehouses. The methods to detect and resolve semantic and representational inconsistency problems have been developed within a certain context. However, for a general data quality context, there is a lack of methods available for domain agnostic data inconsistency problems. Historically, data representational inconsistency has already been discussed in the pre-processing of data cleansing frameworks and data quality tools. However, since they deal with the problems in a certain context or based on a specific domain, users must obtain specific information about the data such as the master data and the data dependencies in order to address these data inconsistency issues. This thesis focuses on domain agnostic data representational inconsistency problems in a general data quality context in a relational database. In this thesis, we employ a declarative method which introduces SQL extensions instead of writing massive amounts of code. To improve data representational consistency, we propose a user-driven pattern-based framework using the iterative and interactive approach and string pattern matching technology. There are three main subtasks: a) design a complete and nearly mutually exclusive pattern library, b) detect all the possible patterns for each record in the target column, and c) unify the inconsistent data records. Then, we improve the pattern detection algorithms for inconsistent data records through a modified DFA (Deterministic Finite Automaton) and comprehensive experiments are conducted to verify the accuracy and efficiency of the proposed approaches. The evaluation results demonstrated that the proposed methods in this thesis have better performance over the naive solution. Finally, we implement a toolkit based on the proposed framework and methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Caching with 'Good Enough' Currency, Consistency, and Completeness

SQL extensions that allow queries to explicitly specify data quality requirements in terms of currency and consistency were proposed in an earlier paper. This paper develops a data quality-aware, finer grained cache model and studies cache design in terms of four fundamental properties: presence, consistency, completeness and currency. The model provides an abstract view of the cache to the que...

متن کامل

HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources

HGVbase (Human Genome Variation database; http://hgvbase.cgb.ki.se, formerly known as HGBASE) is an academic effort to provide a high quality and non-redundant database of available genomic variation data of all types, mostly comprising single nucleotide polymorphisms (SNPs). Records include neutral polymorphisms as well as disease-related mutations. Online search tools facilitate data interrog...

متن کامل

Microsoft Word - VLDB2005full.DOC

SQL extensions that allow queries to explicitly specify data quality requirements in terms of currency and consistency were proposed in an earlier paper. This paper develops a data quality-aware, finer grained cache model and studies cache design in terms of four fundamental properties: presence, consistency, completeness and currency. Such a model provides an abstract view of the cache to the ...

متن کامل

SQL á la Carte - Toward Tailor-made Data Management

The size of the structured query language (SQL) continuously increases. Extensions of SQL for special domains like stream processing or sensor networks come with own extensions, more or less unrelated to the standard. In general, underlying DBMS support only a subset of SQL plus vendor specific extensions. In this paper, we analyze application domains where special SQL dialects are needed or ar...

متن کامل

Microsoft Word - VLDB471-new-7.DOC

SQL extensions that allow queries to explicitly specify data quality requirements in terms of currency and consistency were proposed in an earlier paper. This paper develops a data quality-aware, finer grained cache model and studies cache design in terms of four fundamental properties: presence, consistency, completeness and currency. The model provides an abstract view of the cache to the que...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016